Lab 4: GAN Example Fixing

Team: Mike Wisniewski, Henry Lambson, Alex Gregory

Compare this implimentation to the one from the official Torch tutorial: https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html

Author's Note: We began this lab in late March and thus were unaware on our option to choose the Keras implementation until much later. Although Keras is our stronger suit for Neural Network implementation, we did not convert our work over because we had about 95% of the lab completed at this point. We point this out in case of possible silly mistakes or inconsistencies that wouldn't be normal using Keras. We attempted our best at implementing in torch and we believe our results are sound enough to signal that our implementation is correct or mostly correct.

Utility Functions/Preprocessing

The following section contains code from the given .ipynb with no modifications. The intended use of this code is to provide a way to plot images and load/save model checkpoints. No modifications were made from the original

Vanilla Generative Adversarial Networks

In this implementation of GANS, we will use a few of the tricks from F. Chollet and from Salimans et al. In particular, we will add some noise to the labels.

[3 points] First, look at the code for the generator and discriminator/critic. Adjust the architecture to also sample from one hot encodings and use embeddings of class encodings (like in the LS-GAN paper). Use this same one hot encoding (both sampled and from the actual classes) in the discriminator/critic. This GAN will be your base implementation--run this for at least 500 epochs (ideally more).

In the below code provided to us, we adjust the vanilla Generator and Discriminator by adding an Embedding layer based on the number of classes in this dataset. Per the above code before this section, the number of classes were outlined to us in the "classes" variable. We have 10 classes. We use the length of this variable to construct our embedding class. In order to connect an embedding layer to the first linear layer, we needed to increase the latent dimensions by the number of classes we embedded. This makes sense because as we embed a certain amount of classes, those classes need to be represented within the latent space. Therefore, in order for everything to connect, the Embedding layer must take input of Num Classes, Num Classes, and the Linear layer must take an input of latent space + num classes. This logic is applicable to both the generator and discriminator classes.

In addition to adding an embedding layer, we adjusted the forward method to embed the labels. We then concat the random noise (z) with the embedded labels to form a latent space with embeddings. This is then pass through our network - hence the need to change the dimensions of our inputs and outputs of our generator and discriminator classes.

In the following sections, we run the Vanilla GAN with OHE capability, the Least Squares GAN (with OHE) and the Wasserstein GAN with Gradient Penalty (and with OHE). These models were ran before we were notified to only run one variant, so we included these into our analysis but when implementing 3 new features, we built only off of the Vanilla OHE GAN twice and the WGAN once. Analysis will be provided at the end of each section.

In the above for our vanilla one-hot encoding model, the loss for the Discriminator goes down as the number of epochs increase, while the loss for the generator increases. It is important to note that there appear to be some outliers in the Discriminator analysis, but this does not influence the overall pattern of loss decreasing. The Discriminator decreasing in loss initially suggests that the GAN is becoming better at detecting fakes images. The loss increasing in the Generator initially suggests that the GAN is not able to generate images that closely resemble real images. With both taken into context, there is evidence to suggest that the GAN is only getting better at detecting fake images generated by the Generator and that the Generator is struggling to "fool" the Discriminator. This model took 189 minutes to train fully.

Model Checkpointing

Let's load up a previous run and see how the images evolved over the run. The blocks below use the state_dict property to save and load data. The second block is nice for running this notebook only to show the results of a previous run.

We are getting something that is similar to a frog, but also we are seeing a bit of mode collapse. The global properties of a greenish or gray blob surrounded by various background is starting to comes across. However, the finer structure is not doing too well. That is, the legs and details in the background are not present yet.

To improve this result, there are a number of things we might try such as:


Least Squares GAN

Actually, the only thing we need to do here is replace the adversarial loss function. Note that we are NOT going to make additions to the architecture where the one hot encoding of the classes (and random classes) are used in both the generator and discriminator. This means that we might see a bit more mode collapse in our implementation.

Similar to the Vanilla OHE GAN - Least Squares GAN shows us similar patterns. These patterns suggest that the Discriminator is getting better at detecting fake and generated images from the Generator, as loss is decreasing over the number of epochs. The Generator loss is not necessarily going up over the number of epochs, but it is at a relatively high loss compared to the Discriminator class. Thus, this suggests that the Generator is not generating sufficient generated images to fool the Discriminator. This model took 185 minutes - so far this is the fastest model, but only by 4 minutes.

Well, these results are not exactly a great imprvment. Mode collapse is more apparent here as well, but the fine structure of the frogs is also not quite the improvement that we wanted. Looking back through the iterations, there was some indication of more successful generations. Subjectively, the frogs started to show up, but then generation became slightly worse. We could run this code for many more iterations, and that might work in terms of getting the optimizers to create better distributions. But it is not guaranteed.

Instead, now let's try using a Wasserstein GAN, where we use the gradient penalty as a method of making the discrminator 1-lipschitz (and therefore a valid critic to approximate the earth mover distance).


Wasserstein GAN with Gradient Penalty

For this implementation, we need to add functionality to the gradient of the Discriminator to make it a critic. For the most part, we need to add the gradient loss function calculations to match the WGAN-GP.

This compute_gradient_penalty function for WGAN-GP comes from https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/wgan_gp/wgan_gp.py#L119.

To round out the 3 vanilla variants, WGAN with Gradient Penalty follows similar patterns where the Generator loss increasing over the number of epochs while the Discriminator Loss decreases over the number of epochs. It is important to note that the Discriminator loss can be negative, and often is because of how it is implemented. Regardless of the sign of the loss, Discriminator loss decreases over the number of epochs suggesting that the Discriminator is becoming better at detecting generated images, while the Generator loss increased suggesting that the Generator struggled to generated images that closely match real images. This model took 226 minutes

The WGAN-GP seem to be a bit better (on some runs) for finding more divers backgrounds. You can also notice that one of the runs seems to start finding legs, which is something the other methods struggled with.

Feature Matching

[4 points] Implement one item from the list in the GAN training and generate samples of your dataset images. Explain the method you are using and what you hypothesize will occur with the results. Train the GAN running for at least 500 epochs. Subjectively, did this improve the generated results? Did training time increase or decrease and by how much? Explain.

The first method we implement is feature matching. Feature matching aims to improve Generator performance by encouraging the Generator to produce images based on a set of higher level features - as opposed to current state where the Generator creates images at the pixel level. Feature matching is intended to take the above GANs where the Generator struggles to fool the Discriminator and assist in generating images based on high leveled features. The idea is to take features from the Discriminator and use these features to train the Generator. We can take these features from any part of the Discriminator, but we decide to keep this simple and take the last layer features (before the output). Because the Discriminator will be providing us features, we need some feature loss function to use for training (not BCE like above methods).

We hypothesize that the losses for the Generator will start to decrease over time while the Discriminator losses will either remain neutral or decrease marginally. Typically, if one loss decreases the other increases. We believe this pattern will occur in this case.

Therefore, with the above explanation on our view of Feature Matching, we made the following changes:

It appears that our hypothesis regarding Generator loss turned out to be correct. As for the Discriminator the loss decreases significantly for the first 100 epochs, but does tail off to consistent loss for the last 400 epochs. There is evidence to support that the Generator has become very good at generating images that fool the Discriminator. There is also evidence to suggest that this Discriminator is still very strong at detecting fake images, but it consistently gets fooled by certain generated images as indicated by a consistently non-changing loss. We believe this was a successful implementation of Feature Matching given our research into how it is supposed to be used and implemented. We do question why the Discriminator leveled off and had consistent loss. One hypothesis is perhaps we should have kept the final output layer but created some separate function that extracts features instead. But the research on implementing this suggests to not do this (we referenced this paper: https://paperswithcode.com/method/feature-matching). As shown in his code, he comments out his Sigmoid layer but keeps his other layers active. Whereas, for our method we also take out the sigmoid layer and associated Linear layer but extracted our features from the convolutions. Dr. Larson, thoughts on this?

This model took 193 minutes

Historical Averaging

[4 points] Implement ANOTHER item in the list in the GAN training and generate samples of your dataset images. Repeat the previous step. You can add this to your previous implementation or run this in isolation as you prefer.

The second method we implemented is Historical Averaging. Like Feature matching, Historical Averaging also aims to improve Generator performance by averaging the gradients of generated samples, thereby creating stability in the training. We built Historical Averaging on top of Feature Matching and therefore we believe that Historical Averaging should improve upon the Generator only and not the Discriminator. In order to implement Historical Averaging, we take the gradients of the generator each loop and create a continuous average of these gradients. We then reapply these gradients back to the Generator after backpropogation in order to ensure that the next epoch uses these weight changes to generate sample images.

We hypothesize that the losses for the Generator will marginally increase over epochs because Feature Matching already vastly improved these. Additionally, we hypothesize that the Discriminator loss will not change compared to Feature Matching.

With the above explanation on our view of Historical Averaging, we made the following changes:

There was marginal, if any improvement for the Generator loss. This aligns with our hypothesis in where the historical averages would see little impact. The Discriminator losses were similar as we made no modifications to the Discriminator class.

One change we would make if we were to do this again is to do Historical Averaging in isolation. Although the Generator loss decreased and the Discriminator loss is also solid, there is no evidence to suggest Historical Averaging improved any performance and that Feature Matching is the core contributor to improved performance for these GANs.

This model took 960 minutes. Yikes.

Spectral Normalization for Wasserstein

[4 points] Implement ANOTHER item in the list in the GAN training and generate samples of your dataset images. Repeat the previous step. You can add this to your previous implementation or run this in isolation as you prefer.

Our final method of implementation is Spectral Normalization. From how we understand Spectral Normalization, the idea is to normalize the weight matrices across the spectral norm. We were confident that this needed to occur within the convolutional network, but if we had a larger sequential non-conv network, we could probably apply this method there as well. Through our research, we determined this was supposed to help the overall convergence of the training.

Our hypothesis, even though this only appears to affect the Critic (Discriminator) is that the losses will start to level out for both classes. We believe this because essentially we are normalizing weights, so therefore we would expect losses to not look so exponential as in the above WGAN, but more flat.

Changes we made:

The overall magnitudes of both losses decreased - whereby we don't see nearly as large losses in both the Generator and Discriminator compared to the original WGAN. However, we can not say our hypothesis was supported as both patterns appear to still have an exponential shape. One could argue that our implementation of Spectral Normalization hardly improved upon the original WGAN we had above. The conclusion for this WGAN follows the original - Generator loss increases suggesting that the Generator is unable to generate sample images similar to real images and the Discriminator is able to correctly identify real and generated images.

This model took 249 minutes

Final Analysis

Below we show all charts side by side as well as the times for each and give final thoughts on each GAN.

Overall, the best performing GAN based on both Discrimintor and Generator loss is the Vanilla OHE with Feature Matching. The Vanilla Historical Averaging took the longest, but this is somewhat expected due to the way we did averages (which was every epoch). This could be improved upon timewise by using other averaging methods or a better algorithm. Note: we did run this twice to make sure it was actually taking 15 hours and it did both times. Taking out this outlier, the Vanilla LS GAN was the quickest at 185 minutes, but the best performing model only took about 8 minutes longer at 193 minutes total.

Compared to our baseline model, the feature matching GAN significantly outperformed this model by our metric of loss and was comparable in terms of time. We are confident to say that feature matching implementation helped improve our GAN model signficantly.